Background

Sample covariance

Sample covariance between variables \(x=x_t\) and \(y=y_t\):

\[ c_{xy} = \frac{1}{N} \sum_i^N (x_i-\overline{x}) (y_i-\overline{y}) \]

Sample cross-covariance function for positive values of lag between variables \(x_t\) and \(y_{t+k}\) (Chatfield, The Analysis of Time Series, 2004):

\[ c_{xy}(k) = \frac{1}{N} \sum_{t=1}^{N-k} (x_t-\overline{x})(y_{t+k}-\overline{y}) \]

Sample correlation

Pearson’s correlation coefficient (sample correlation) is defined as the covariance of two variables divided by the product of their standard deviations (which are the square roots of their respective variances):

\[ r_{xy} = \frac{c_{xy}}{\sqrt{c_{xx}c_{yy}}} %= \frac{\sum (x_i-\overline{x})(y_i-\overline{y})}{\sqrt{ \sum (x_i-\overline{x})^2 \sum (y_i-\overline{y})^2 }} \]

The sample cross-correlation function: \[ r_{xy}(k) = \frac{c_{xy}(k)}{\sqrt{c_{xx}(0)c_{yy}(0)}} \]

\(c_{xx}\) and \(c_{yy}\) are the sample variances of \(x\) and \(y\), respectively.

\(c_{xx}(0)\) and \(c_{yy}(0)\) that are the sample variances of \(x_t\) and \(y_t\) respectively.

Correlation coefficients illustrated

“For descriptive purposes, the relationship will be described as strong if \(|r| \geq .8\), moderate if \(.5 < |r| <.8\), and weak if \(|r| \leq .5\).” – Devore and Berk, Modern Mathematical Statistics with Applications, 2012

Devore and Berk, _Modern Mathematical Statistics with Applications_, 2012, Figure 12.24

Anscombe’s quartet classically illustrates the pitfalls on relying on a single coefficient – always visualize your data. Consider the following four datasets:

All have similar statistical properties.

set mean of x mean of y variance of x variance of y correlation intercept slope
1 9 7.5 11 4.127 0.82 3 0.5
2 9 7.5 11 4.128 0.82 3 0.5
3 9 7.5 11 4.123 0.82 3 0.5
4 9 7.5 11 4.123 0.82 3 0.5

Cross (lagged) correlations illustrated

Illustration of lag for 20.07.2013 in Lausanne:

lag_example

  • Radiation initiates the photochemistry for ozone formation and is strongly linked to the concentration of atmospheric oxidants.
  • Increase in radiation also leads to increase in temperatures, which accelerate rates of reaction.
  • The ozone production is not instantaneous, hence the lag.

Could a similar relationship be confirmed by examining correlations between the daily maximum values of radiation and ozone?

Confounding interpretations

A correlation or lagged correlation in x and y may also be observed. For examples, a correlation between \(x\) and \(y\) may not be due to the causal relationship between \(x\) and \(y\), but dependent on a third variable, \(z\). This is written:

\[ \begin{aligned} y_t &= f_y(z_t)\\ x_t &= f_x(z_t) \end{aligned} \]

\[ \begin{aligned} y_{t+k} &= f_y(z_t)\\ x_{t} &= f_x(z_t) \end{aligned} \]

Correlation does not imply causation”:

  • The variation between two variables may depend on a third, as shown above.
  • Even if the variation in one variable is dependent on another, the correlation does not indicate the direction of causality.

R demonstration

library(dplyr)
library(reshape2)
library(chron)
library(ggplot2)
source("GRB001.R")
Sys.setlocale("LC_TIME","C")
options(stringsAsFactors=FALSE)
options(chron.year.abb=FALSE)
theme_set(theme_bw()) # just my preference for plots

Let us load data saved from Lesson 4.

df <- readRDS("data/2013/lau-zue.rds")
variables <- c("O3", "NO2", "CO", "PM10", "SO2", "NMVOC", "EC", "TEMP", "PREC", "RAD")

Pairwise correlations

Let us calculate the daily maximum:

lf <- melt(df, measure.vars=variables)
dm <- lf %>% group_by(site, year, month, day, season, variable) %>%
  summarize(value=max(value, na.rm=TRUE))
daily.max <- dcast(dm, site + year + month + day + season ~ variable)
rm(dm) # no longer needed

Recall the relationship between temperature and O3 shown in a previous lecture. Note the seasonal dependence of this relationship.

ggp <- ggplot(daily.max)+
  facet_grid(site~season)+
  geom_point(aes(TEMP,O3))
print(ggp)

We can examine the correlation of other pollutants or meterological variables with O3.

lf <- melt(daily.max, measure.vars=setdiff(variables, "O3")) # everything but ozone
ggp <- ggplot(lf)+
  facet_grid(site~variable, scale="free_x")+
  geom_point(aes(value, O3, group=season, color=season), shape=4)
print(ggp)

ggp <- ggplot(df)+
  facet_grid(site~season)+
  geom_point(aes(CO, NO2))
print(ggp)

For the following scatterplot matrix, we will use a package called lattice which is a plotting system that exists in parallel to R’s base graphics and ggplot2 package (confusing, I know). We additionally define a function to include correlation coefficients in our panels.

library(lattice)

Correlation.value <- function(x, y, ...) {
  correlation <- cor(x, y,use="pairwise")
  if(!is.na(correlation)) {
    cpl <- current.panel.limits()
    panel.text(mean(cpl$xlim),mean(cpl$ylim),
               bquote(italic(r)==.(sprintf("%.2f",correlation))),
               adj=c(0.5,0.5),col="blue")
  }
}

Plot daily maximum values as pairwise points (only Lausanne).

ix <- grepl("LAU", daily.max[["site"]], fixed=TRUE)
spp <- splom(~daily.max[ix,c("O3","NO2","CO","PM10","TEMP","PREC","RAD")] | daily.max[ix,"season"],
             upper.panel = Correlation.value,
             pch=4)
print(spp)

Plot hourly data as pairwise points. Given the large number of points, we can “smooth” the visual representation.

ix <- grepl("LAU", df[["site"]], fixed=TRUE)
spp <- splom(~df[ix,c("O3","NO2","CO","PM10","TEMP","PREC","RAD")] | df[ix,"season"],
             upper.panel = Correlation.value,
             panel = panel.smoothScatter,             
             pch=4)
print(spp)

Notice some values of correlation went up.

Lagged correlations

Lag <- function(pair, k) {
  out <- data.frame(k, head(pair[,1],-k), tail(pair[,2],-k))
  names(out) <- c("lag", colnames(pair))
  out
}

lagged <- df %>%
  group_by(site, season) %>%
    do(rbind(Lag(.[,c("RAD","O3")], 1),
             Lag(.[,c("RAD","O3")], 2),
             Lag(.[,c("RAD","O3")], 3),
             Lag(.[,c("RAD","O3")], 4),
             Lag(.[,c("RAD","O3")], 5),
             Lag(.[,c("RAD","O3")], 6)))
ggp <- ggplot(lagged) +
  geom_point(aes(RAD, O3, group=site, color=site), shape=4)+
    facet_grid(lag~season)
print(ggp)

R include a function, ccf, for calculating the cross correlation.

ccf(df[,"RAD"], df[,"O3"], lag.max=6, na.action=na.pass)

This tells us that the correlation is higest at a lag of -3.

We can wrap this in a function that retuns information that we want in a data frame:

LaggedCorrelation <- function(pair, ...) {
  out <- ccf(pair[,1], pair[,2], ..., na.action=na.pass, plot=FALSE)
  data.frame(lag=out[["lag"]], value=out[["acf"]])
}

lagged.values <- df %>% group_by(season) %>%
  do(LaggedCorrelation(.[,c("RAD","O3")], lag.max=6))

Note the syntax, ... in the function definition This passes on any additional arguments (lag.max in this case) from the outer function (LaggedCorrelation) to the inner-most function (ccf).

Plot:

ggp <- ggplot(lagged.values)+
  geom_segment(aes(x=lag,xend=lag,y=0,yend=value))+
  facet_grid(.~season)+
  xlab("lag (hours)")+
  ylab("Cross correlation coefficient")
print(ggp)